Introduction

This Rmarkdown script (and corresponding TAB-separated CSV input data file InfoRateData.csv and the resulting HTML document) contain the full analysis and plotting code accompanying the paper Human languages share an optimal information transmission rate.

The data

For more information on the data, please see Oh (2015). There are in total 17 languages (see the Table below).

The Oral Corpus

The oral corpus is based on a subset of the Multext (Multilingual Text Tools and Corpora) parallel corpus (Campione & Véronis, 1998) in British English, German, and Italian. The material consists of 15 short texts of 3-5 semantically connected sentences carefully translated by a native speaker in each language.

For the other 14 languages, two of the authors supervised the translation and recording of new datasets. All participants were native speakers of the target language, with a focus on a specific variety of the language when possible – e.g. Mandarin spoken in Beijing, Serbian in Belgrade and Korean in Seoul. No strict control on age or on the speakers’ social diversity was performed, but speakers were mainly students or members of academic institutions. Speakers were asked to read three times (first silently and then loudly twice) each text. The texts were presented one by one on the screen in random order, in a self-paced reading paradigm. This way, speakers familiarized themselves with the text and reduce their reading errors. The second loud recording was analyzed in this study.

Text Corpus

Text datasets were acquired from various sources as illustrated in the Table below. After an initial data curation, each dataset was phonetically transcribed and automatically syllabified by a rule-based program written by one of the authors, except in the following cases:

  1. when syllabification was already provided with the dataset (English, French, German, and Vietnamese for the multisyllabic words);
  2. when the corpus was syllabified by an automatic grapheme-to-phoneme converter (Catalan, Spanish, and Thai).

Additionally, no syllabification was required for Sino-Tibetan languages (Cantonese and Mandarin Chinese) since one ideogram corresponds to one syllable.

Language Family ISO 639-3 Corpus
Basque Basque EUS E-Hitz (Perea et al., 2006)
British English Indo-European ENG WebCelex (MPI for Psycholinguistics)
Cantonese Sino-Tibetan YUE A linguistic corpus of mid-20th century Hong Kong Cantonese
Catalan Indo-European CAT Frequency dictionary (Zséder et al., 2012)
Finnish Uralic FIN Finnish Parole Corpus
French Indo-European FRA Lexique 3.80 (New et al., 2001)
German Indo-European DEU WebCelex (MPI for Psycholinguistics)
Hungarian Uralic HUN Hungarian National Corpus (Váradi, 2002)
Italian Indo-European ITA The Corpus PAISÀ (Lyding et al., 2014)
Japanese Japanese JPN Japanese Internet Corpus (Sharoff, 2006)
Korean Korean KOR Leipzig Corpora Collection (LCC)
Mandarin Chinese Sino-Tibetan CMN Chinese Internet Corpus (Sharoff, 2006)
Serbian Indo-European SRP Frequency dictionary (Zséder et al., 2012)
Spanish Indo-European SPA Frequency dictionary (Zséder et al., 2012)
Thai Tai-Kadai THA Thai National Corpus (TNC)
Turkish Turkic TUR Leipzig Corpora Collection (LCC)
Vietnamese Austroasiatic VIE VNSpeechCorpus (Le et al., 2004)

Dataset structure

The data is structured as follows:

  • Language is the unique language ISO 639-3 ID with 17 possible values “CAT”, “CMN”, “DEU”, “ENG”, “EUS”, “FIN”, “FRA”, “HUN”, “ITA”, “JPN”, “KOR”, “SPA”, “SRP”, “THA”, “TUR”, “VIE”, “YUE”;
  • Language family is the language family with 9 possible values “Austroasiatic”, “Basque”, “Indo-European”, “Japanese”, “Korean”, “Sino-Tibetan”, “Tai-Kadai”, “Turkic”, “Uralic”;
  • Text is the text identifier (similar across languages and speakers) with 15 possible values “O1”, “O2”, “O3”, “O4”, “O6”, “O8”, “O9”, “P0”, “P1”, “P2”, “P3”, “P8”, “P9”, “Q0”, “Q1”;
  • Speaker is the unique speaker’s ID (there are 170 in the dataset);
  • Sex is the speaker’s sex;
  • Age is the speaker’s age (available only for a subset of 132 speakers);
  • Duration is the texts’ duration in seconds as spoken by the given speaker (pauses longer than 150ms being excluded);
  • NS is the texts’ number of of syllables (the same for a given text in a given language);
  • SR is the speech rate (syllables/second) for a given text spoken by a given speaker;
  • ShE (resp. ID) is the first-order (resp. second-order) language-level entropy estimate;
  • ShIR and IR are text * speaker properties obtained by multiplying SR by ShE and ID respectively.

Technical notes on regressions

We use throughout sum contrasts for the factor IVs, which are orthogonal contrasts which compare every level of the IV to the overall mean (for example, for a two-levels factor such as Sex we do not compare Males with Females but each with their overall mean, which is included in the intercept). However, in R the contr.sum() function used to define this contrasts produces level names that are very uninformative, so we explicit these below (please note that in the model outputs the last level is usually not shown):

  • Sex: Sex1 = F, Sex2 = M (the last level, Sex2 is usually not displayed);
  • Text: Text1 = O1, Text2 = O2, Text3 = O3, Text4 = O4, Text5 = O6, Text6 = O8, Text7 = O9, Text8 = P0, Text9 = P1, Text10 = P2, Text11 = P3, Text12 = P8, Text13 = P9, Text14 = Q0, Text15 = Q1 (the last level, Text15 is usually not displayed);
  • Language: Language1 = CAT, Language2 = CMN, Language3 = DEU, Language4 = ENG, Language5 = EUS, Language6 = FIN, Language7 = FRA, Language8 = HUN, Language9 = ITA, Language10 = JPN, Language11 = KOR, Language12 = SPA, Language13 = SRP, Language14 = THA, Language15 = TUR, Language16 = VIE, Language17 = YUE (the last level, Language17 is usually not displayed);
  • Family: Family1 = Austroasiatic, Family2 = Basque, Family3 = Indo-European, Family4 = Japanese, Family5 = Korean, Family6 = Sino-Tibetan, Family7 = Tai-Kadai, Family8 = Turkic, Family9 = Uralic (the last level, Family9 is usually not displayed);

Exploratory plots and summaries

Speaker characteristics

Distribution of speaker number and characteristics (sex and age) by language. Lng=language, # spkrs=number of speakers, % fem=percent female speakers, # age=number of speakers with age info, and for those, mean(age)=mean age, sd(age)=standard deviation of age, and actual ages=the sorted ages.
Lng # spkrs % fem # age mean(age) sd(age) actual ages
CAT 10 50 10 35.4 9.2 (21, 28, 28, 29, 31, 39, 42, 42, 44, 50)
CMN 10 50 9 23.1 4.5 (19, 19, 19, 19, 24, 24, 25, 28, 31)
DEU 10 50 0 NaN NaN ()
ENG 10 50 0 NaN NaN ()
EUS 10 50 10 28.0 4.9 (19, 22, 26, 27, 28, 29, 30, 31, 32, 36)
FIN 10 50 10 33.2 11.0 (16, 22, 26, 28, 30, 35, 37, 41, 45, 52)
FRA 10 50 10 32.5 7.7 (24, 25, 25, 27, 28, 36, 36, 37, 41, 46)
HUN 10 50 10 39.3 15.8 (17, 27, 27, 31, 33, 39, 42, 51, 57, 69)
ITA 10 50 0 NaN NaN ()
JPN 10 50 10 30.6 12.8 (20, 20, 21, 22, 22, 28, 29, 40, 51, 53)
KOR 10 50 10 28.6 10.6 (16, 19, 19, 19, 28, 31, 33, 35, 36, 50)
SPA 10 50 10 33.7 10.1 (21, 22, 26, 28, 30, 32, 42, 42, 44, 50)
SRP 10 50 10 30.6 7.8 (19, 21, 23, 30, 31, 32, 34, 34, 38, 44)
THA 10 50 10 30.1 5.7 (23, 23, 27, 28, 30, 31, 31, 32, 33, 43)
TUR 10 50 7 32.6 7.2 (24, 25, 30, 31, 37, 37, 44)
VIE 10 50 6 27.2 4.1 (21, 25, 26, 28, 31, 32)
YUE 10 50 10 22.0 1.5 (20, 20, 21, 21, 22, 22, 23, 23, 24, 24)

NS

NS: exploratory plots.

NS: exploratory plots.

mean=101.196, median=100, sd=24.703, CV=0.244, min=49, max=162, kurtosis=2.484, skewness=0.177.

SR

SR: exploratory plots.

SR: exploratory plots.

SR per speaker.

SR per speaker.

SR by Sex and Age across Languages.

SR by Sex and Age across Languages.

SR by Sex, Age and Language.

SR by Sex, Age and Language.

SR by language.

SR by language.

mean=6.631, median=6.777, sd=1.148, CV=0.173, min=3.589, max=9.492, kurtosis=2.408, skewness=-0.168.

ShE and ID

ShE and ID: exploratory plots.

ShE and ID: exploratory plots.

ShE vs ID.

ShE vs ID.


    Pearson's product-moment correlation

data:  tmp1$ShE and tmp1$ID
t = 2.0326, df = 15, p-value = 0.06019
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.02052914  0.77274779
sample estimates:
      cor 
0.4647009 

    Spearman's rank correlation rho

data:  tmp1$ShE and tmp1$ID
S = 451.88, p-value = 0.07259
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.4462208 

    Paired t-test

data:  tmp1$ShE and tmp1$ID
t = 11.635, df = 16, p-value = 3.213e-09
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 2.158040 3.119607
sample estimates:
mean of the differences 
               2.638824 

ShE:

mean=8.621, median=8.69, sd=0.904, CV=0.105, min=6.07, max=9.83, kurtosis=4.665, skewness=-1.122.

ID:

mean=6.009, median=5.56, sd=0.883, CV=0.147, min=4.83, max=8.02, kurtosis=2.53, skewness=0.747.

ShIR and IR

ShIR: exploratory plots.

ShIR: exploratory plots.

ShIR per speaker.

ShIR per speaker.

ShIR by Sex and Age across Languages.

ShIR by Sex and Age across Languages.

ShIR by Sex, Age and Language.

ShIR by Sex, Age and Language.

ShIR by language.

ShIR by language.

IR: exploratory plots.

IR: exploratory plots.

IR per speaker.

IR per speaker.

IR by Sex and Age across Languages.

IR by Sex and Age across Languages.

IR by Sex, Age and Language.

IR by Sex, Age and Language.

IR by language.

IR by language.

ShIR:

mean=56.709, median=57.207, sd=9.35, CV=0.165, min=32.772, max=89.235, kurtosis=2.444, skewness=0.079.

IR:

mean=39.153, median=39.13, sd=5.097, CV=0.13, min=25.631, max=60.692, kurtosis=3.622, skewness=0.325.

SR and ID

SR vs ID

SR vs ID


    Pearson's product-moment correlation

data:  info.rate.data$SR and info.rate.data$ID
t = -45.329, df = 2286, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.7090031 -0.6658066
sample estimates:
       cor 
-0.6880138 

    Spearman's rank correlation rho

data:  info.rate.data$SR and info.rate.data$ID
S = 3393600000, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
       rho 
-0.6999614 

Linear Mixed Modells (LMMs)

Linear ICCs

ICC for model NS ~ 1 + (1 | Text) + (1 | Language)+ (1 | Speaker).
Level-1 factor (f) ICC
Text 0.26
Family 0.42
Language 0.19
Speaker 0.00
ICC for model SR ~ 1 + (1 | Text) + (1 | Language)+ (1 | Speaker).
Level-1 factor (f) ICC
Text 0.01
Family 0.50
Language 0.19
Speaker 0.24
ICC for model ShIR ~ 1 + (1 | Text) + (1 | Language)+ (1 | Speaker).
Level-1 factor (f) ICC
Text 0.01
Family 0.47
Language 0.13
Speaker 0.30
ICC for model IR ~ 1 + (1 | Text) + (1 | Language)+ (1 | Speaker).
Level-1 factor (f) ICC
Text 0.02
Family 0.03
Language 0.32
Speaker 0.49

SR

Comparing models

AIC and BIC for various hierarchical models of SR (in bold are the minimum values).
model AIC BIC
1 + (1 | Text) + (1 | Family/Language) + (1 | Speaker) 2127.51 2161.93
1 + (1 | Family/Language) + (1 | Speaker) 2397.97 2426.65
1 + (1 | Text) + (1 | Speaker) 2257.73 2280.68
1 + (1 | Text) + (1 | Family/Language) 4717.06 4745.73
1 + Sex + (1 | Text) + (1 | Family/Language) + (1 | Speaker) 2121.04 2161.19
1 + Sex + (1 | Family/Language) + (1 | Speaker) 2391.29 2425.7
1 + Sex + (1 | Text) + (1 | Speaker) 2258.63 2287.31
1 + Sex + (1 | Text) + (1 | Family/Language) 4584.11 4618.52

Residuals and random effects

We consider here the full model SR ~ 1 + Sex + (1|Text) + (1|Family/Language) + (1|Speaker).

IR

Comparing models

AIC and BIC for various hierarchical models of IR (in bold are the minimum values).
model AIC BIC
1 + (1 | Text) + (1 | Family/Language) + (1 | Speaker) 10257.84 10292.25
1 + (1 | Family/Language) + (1 | Speaker) 10524.5 10553.18
1 + (1 | Text) + (1 | Speaker) 10305.76 10328.71
1 + (1 | Text) + (1 | Family/Language) 12888.49 12917.17
1 + Sex + (1 | Text) + (1 | Family/Language) + (1 | Speaker) 10247.34 10287.49
1 + Sex + (1 | Family/Language) + (1 | Speaker) 10513.79 10548.21
1 + Sex + (1 | Text) + (1 | Speaker) 10300.05 10328.72
1 + Sex + (1 | Text) + (1 | Family/Language) 12747.04 12781.45

Residuals and random effects

We consider here the full model IR ~ 1 + Sex + (1|Text) + (1|Family/Language) + (1|Speaker).

Generalized Additive Models for Location, Scale and Shape (GAMLSS)

We will use a Gaussian distribution (with fixed or modelled variance).

SR

Fixed or modelled σ (variance)

Fixed σ

******************************************************************
          Summary of the Quantile Residuals
                           mean   =  -7.131581e-05 
                       variance   =  1.000437 
               coef. of skewness  =  0.04303609 
               coef. of kurtosis  =  3.710291 
Filliben correlation coefficient  =  0.9979187 
******************************************************************


Deviance= 1177.818 

AIC= 1576.98 
Modelled σ

******************************************************************
          Summary of the Quantile Residuals
                           mean   =  0.002001851 
                       variance   =  1.000433 
               coef. of skewness  =  0.03014483 
               coef. of kurtosis  =  2.864355 
Filliben correlation coefficient  =  0.9994151 
******************************************************************


Deviance= 815.9172 

AIC= 1405.705 

The distribution of the residuals is less heteroscedastic than before and the fit to the data better. The full summary of the model is:

******************************************************************
Family:  c("NO", "Normal") 

Call:  gamlss(formula = SR ~ 1 + Sex + random(Text) + random(Language) +      random(Family) + random(Speaker), sigma.formula = ~1 +      Sex + random(Text) + random(Language) + random(Family) +  
    random(Speaker), family = NO(mu.link = "identity"),      data = d, control = gamlss.control(n.cyc = 800,          trace = FALSE), i.control = glim.control(bf.cyc = 800)) 

Fitting method: RS() 

------------------------------------------------------------------
Mu link function:  identity
Mu Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  6.629552   0.005816 1139.83   <2e-16 ***
Sex1        -0.168157   0.005816  -28.91   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
Sigma link function:  log
Sigma Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.23693    0.01478 -83.673  < 2e-16 ***
Sex1        -0.05788    0.01478  -3.915 9.34e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
NOTE: Additive smoothing terms exist in the formulas: 
 i) Std. Error for smoothers are for the linear effect only. 
ii) Std. Error for the linear terms maybe are not accurate. 
------------------------------------------------------------------
No. of observations in the fit:  2288 
Degrees of Freedom for the fit:  294.8937
      Residual Deg. of Freedom:  1993.106 
                      at cycle:  53 
 
Global Deviance:     815.9172 
            AIC:     1405.705 
            SBC:     3097.048 
******************************************************************

Random effects

μ

Text

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 14.39185 
Random effect parameter sigma_b: 0.109103 
Smoothing parameter lambda     : 84.5402 

Language

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 16.98344 
Random effect parameter sigma_b: 0.870621 
Smoothing parameter lambda     : 1.32916 

Family

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 0.01237646 
Random effect parameter sigma_b: 1.91981e-05 
Smoothing parameter lambda     : 2713240000 

Speaker

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 165.4211 
Random effect parameter sigma_b: 0.569042 
Smoothing parameter lambda     : 3.32891 
σ

Text

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 0.00152528 
Random effect parameter sigma_b: 0.000518285 
Smoothing parameter lambda     : 3475370 

Language

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 14.62296 
Random effect parameter sigma_b: 0.153319 
Smoothing parameter lambda     : 39.9698 

Family

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 0.001713103 
Random effect parameter sigma_b: 0.000184844 
Smoothing parameter lambda     : 27323100 

Speaker

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 79.45882 
Random effect parameter sigma_b: 0.181484 
Smoothing parameter lambda     : 29.3639 

IR

Fixed or modelled σ (variance)

Fixed σ

******************************************************************
          Summary of the Quantile Residuals
                           mean   =  8.557955e-05 
                       variance   =  1.000437 
               coef. of skewness  =  0.09721211 
               coef. of kurtosis  =  3.684191 
Filliben correlation coefficient  =  0.9978017 
******************************************************************


Deviance= 9322.636 

AIC= 9721.861 
Modelled σ

******************************************************************
          Summary of the Quantile Residuals
                           mean   =  0.002104815 
                       variance   =  1.000442 
               coef. of skewness  =  0.03139581 
               coef. of kurtosis  =  2.864031 
Filliben correlation coefficient  =  0.9993924 
******************************************************************


Deviance= 8961.553 

AIC= 9554.321 

Again, this is a better fit to the data. The full summary of the model is:

******************************************************************
Family:  c("NO", "Normal") 

Call:  gamlss(formula = IR ~ 1 + Sex + random(Text) + random(Language) +      random(Family) + random(Speaker), sigma.formula = ~1 +      Sex + random(Text) + random(Language) + random(Family) +  
    random(Speaker), family = NO(mu.link = "identity"),      data = d, control = gamlss.control(n.cyc = 800,          trace = FALSE), i.control = glim.control(bf.cyc = 800)) 

Fitting method: RS() 

------------------------------------------------------------------
Mu link function:  identity
Mu Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 39.14451    0.03463 1130.31   <2e-16 ***
Sex1        -1.01064    0.03463  -29.18   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
Sigma link function:  log
Sigma Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.54554    0.01478  36.903  < 2e-16 ***
Sex1        -0.05935    0.01478  -4.015 6.17e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
NOTE: Additive smoothing terms exist in the formulas: 
 i) Std. Error for smoothers are for the linear effect only. 
ii) Std. Error for the linear terms maybe are not accurate. 
------------------------------------------------------------------
No. of observations in the fit:  2288 
Degrees of Freedom for the fit:  296.3836
      Residual Deg. of Freedom:  1991.616 
                      at cycle:  6 
 
Global Deviance:     8961.553 
            AIC:     9554.321 
            SBC:     11254.21 
******************************************************************

Random effects

μ

Text

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 14.41202 
Random effect parameter sigma_b: 0.660353 
Smoothing parameter lambda     : 2.30777 

Language

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 16.95235 
Random effect parameter sigma_b: 3.10086 
Smoothing parameter lambda     : 0.104779 

Family

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 0.01908253 
Random effect parameter sigma_b: 0.000233421 
Smoothing parameter lambda     : 18354000 

Speaker

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 165.2917 
Random effect parameter sigma_b: 3.39374 
Smoothing parameter lambda     : 0.0935855 
σ

Text

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 0.0004688834 
Random effect parameter sigma_b: 0.000282878 
Smoothing parameter lambda     : 11664800 

Language

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 14.73159 
Random effect parameter sigma_b: 0.157607 
Smoothing parameter lambda     : 37.8208 

Family

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 0.001453389 
Random effect parameter sigma_b: 0.00011236 
Smoothing parameter lambda     : 73935500 

Speaker

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 80.97495 
Random effect parameter sigma_b: 0.184921 
Smoothing parameter lambda     : 28.2979 

Modelling the relationship between SR and ID

Let’s model SR with ID as an additional predictor (fixed effect) interacting with Sex. N.B. In this case, we must drop Language as a random effect, since each language has, by definition, only one value of ID.

******************************************************************
Family:  c("NO", "Normal") 

Call:  gamlss(formula = SR ~ 1 + ID * Sex + random(Text) +      random(Speaker) + random(Family), sigma.formula = ~1 +      ID + Sex + random(Text) + random(Speaker) + random(Family),  
    family = NO(mu.link = "identity"), data = d, control = gamlss.control(n.cyc = 800,          trace = FALSE), i.control = glim.control(bf.cyc = 800)) 

Fitting method: RS() 

------------------------------------------------------------------
Mu link function:  identity
Mu Coefficients:
             Estimate Std. Error  t value Pr(>|t|)    
(Intercept) 11.970094   0.037525  318.987  < 2e-16 ***
ID          -0.888703   0.005992 -148.324  < 2e-16 ***
Sex1        -0.062504   0.037524   -1.666  0.09593 .  
ID:Sex1     -0.017079   0.005991   -2.851  0.00441 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
Sigma link function:  log
Sigma Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.73603    0.10289  -7.153 1.18e-12 ***
ID          -0.08460    0.01694  -4.993 6.45e-07 ***
Sex1        -0.05723    0.01478  -3.871 0.000112 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
NOTE: Additive smoothing terms exist in the formulas: 
 i) Std. Error for smoothers are for the linear effect only. 
ii) Std. Error for the linear terms maybe are not accurate. 
------------------------------------------------------------------
No. of observations in the fit:  2288 
Degrees of Freedom for the fit:  293.3745
      Residual Deg. of Freedom:  1994.626 
                      at cycle:  8 
 
Global Deviance:     790.2025 
            AIC:     1376.951 
            SBC:     3059.581 
******************************************************************

******************************************************************
          Summary of the Quantile Residuals
                           mean   =  0.001905342 
                       variance   =  1.000433 
               coef. of skewness  =  0.01947475 
               coef. of kurtosis  =  2.77394 
Filliben correlation coefficient  =  0.9993355 
******************************************************************


Deviance= 790.2025 

AIC= 1376.951 

Adding ID as a predictor improves the fits (as judged by AIC). There is a negative estimate for ID, but significance is difficult to assess with GAMLSS model involving smoothing functions. However, also using a simple lmer model we have a significant effect of ID:

Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: SR ~ 1 + ID * Sex + (1 | Text) + (1 | Speaker) + (1 | Family)
   Data: info.rate.data

REML criterion at convergence: 2121.2

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.7806 -0.6226  0.0176  0.5808  5.1707 

Random effects:
 Groups   Name        Variance Std.Dev.
 Speaker  (Intercept) 0.4618   0.6795  
 Text     (Intercept) 0.0172   0.1311  
 Family   (Intercept) 0.2443   0.4942  
 Residual             0.1063   0.3260  
Number of obs: 2288, groups:  Speaker, 170; Text, 15; Family, 9

Fixed effects:
              Estimate Std. Error         df t value Pr(>|t|)    
(Intercept)  10.521343   0.637420  42.244565  16.506  < 2e-16 ***
ID           -0.658948   0.101261  54.414768  -6.507 2.52e-08 ***
Sex1         -0.128232   0.367323 155.446204  -0.349    0.727    
ID:Sex1      -0.006959   0.060377 155.565771  -0.115    0.908    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
        (Intr) ID     Sex1  
ID      -0.959              
Sex1     0.002 -0.002       
ID:Sex1 -0.002  0.002 -0.990
Type III Analysis of Variance Table with Satterthwaite's method
       Sum Sq Mean Sq NumDF   DenDF F value    Pr(>F)    
ID     4.4998  4.4998     1  54.415 42.3465 2.516e-08 ***
Sex    0.0130  0.0130     1 155.446  0.1219    0.7275    
ID:Sex 0.0014  0.0014     1 155.566  0.0133    0.9084    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Random effects

μ

Text

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 14.41708 
Random effect parameter sigma_b: 0.11065 
Smoothing parameter lambda     : 82.1944 

Speaker

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 167.4872 
Random effect parameter sigma_b: 0.7577 
Smoothing parameter lambda     : 1.8794 

Family

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 0.01763365 
Random effect parameter sigma_b: 0.000168856 
Smoothing parameter lambda     : 35072900 

σ

Text

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 0.03217714 
Random effect parameter sigma_b: 0.00241144 
Smoothing parameter lambda     : 152754 

Speaker

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 104.4197 
Random effect parameter sigma_b: 0.242522 
Smoothing parameter lambda     : 15.8243 

Family

Random effects fit using the gamlss function random() 
Degrees of Freedom for the fit : 0.000711841 
Random effect parameter sigma_b: 0.000279392 
Smoothing parameter lambda     : 11379200 

Do Age and Sex matter?

Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: SR ~ Age * Sex + (1 | Text) + (1 | Language)
   Data: info.rate.data

REML criterion at convergence: 3743.7

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.4947 -0.6305  0.0105  0.5978  3.6970 

Random effects:
 Groups   Name        Variance Std.Dev.
 Text     (Intercept) 0.01299  0.1140  
 Language (Intercept) 1.01328  1.0066  
 Residual             0.36283  0.6024  
Number of obs: 1979, groups:  Text, 15; Language, 14

Fixed effects:
              Estimate Std. Error         df t value Pr(>|t|)    
(Intercept)  6.809e+00  2.753e-01  1.417e+01  24.734 4.61e-13 ***
Age         -5.972e-03  1.592e-03  1.951e+03  -3.751 0.000181 ***
Sex1        -1.084e-01  4.608e-02  1.948e+03  -2.351 0.018804 *  
Age:Sex1    -1.223e-03  1.440e-03  1.949e+03  -0.849 0.395965    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
         (Intr) Age    Sex1  
Age      -0.176              
Sex1      0.018 -0.107       
Age:Sex1 -0.019  0.115 -0.956
Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: IR ~ Age * Sex + (1 | Text) + (1 | Family/Language)
   Data: info.rate.data

REML criterion at convergence: 10792.3

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.2256 -0.6274  0.0068  0.5901  4.6601 

Random effects:
 Groups          Name        Variance Std.Dev.
 Text            (Intercept)  0.4338  0.6586  
 Language:Family (Intercept)  8.4982  2.9152  
 Family          (Intercept)  2.1402  1.4629  
 Residual                    12.9843  3.6034  
Number of obs: 1979, groups:  Text, 15; Language:Family, 14; Family, 9

Fixed effects:
              Estimate Std. Error         df t value Pr(>|t|)    
(Intercept)  3.996e+01  1.009e+00  1.049e+01  39.596 9.13e-13 ***
Age         -3.985e-02  9.517e-03  1.954e+03  -4.187 2.95e-05 ***
Sex1        -6.705e-01  2.756e-01  1.950e+03  -2.433   0.0151 *  
Age:Sex1    -7.286e-03  8.615e-03  1.950e+03  -0.846   0.3978    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
         (Intr) Age    Sex1  
Age      -0.285              
Sex1      0.028 -0.106       
Age:Sex1 -0.030  0.114 -0.956

So, it seems Age and Sex are both worth including in our models (even if we have to discard quite a bit of data because of missing Age info). (In fact, the effect of Age seems more significant than that of Sex.)

In the following, we investigate if Age does matter when using GAMLSS modelling…

GAMLSS with Age

Bescause there is missing data fro Age, and because the GAMLSS models require no missing data, we will fit the models with Age (and its interaction with Sex) on the subset of the data that contains only those speakers with Age info. To make comparability possible, we also fit the same models but without Age on the exact same subset of the data.

SR

Fixed or modelled σ (variance)
Fixed σ

******************************************************************
          Summary of the Quantile Residuals
                           mean   =  -8.252732e-05 
                       variance   =  1.000506 
               coef. of skewness  =  0.04585 
               coef. of kurtosis  =  3.821224 
Filliben correlation coefficient  =  0.9973638 
******************************************************************

The model including Age * Sex is:

******************************************************************
Family:  c("NO", "Normal") 

Call:  gamlss(formula = SR ~ 1 + Sex * Age + random(Text) +      random(Language) + random(Family) + random(Speaker),      family = NO(mu.link = "identity"), data = info.rate.data.for.age,      control = gamlss.control(n.cyc = 800, trace = FALSE),  
    i.control = glim.control(bf.cyc = 800)) 

Fitting method: RS() 

------------------------------------------------------------------
Mu link function:  identity
Mu Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  6.7419779  0.0234704 287.254  < 2e-16 ***
Sex1        -0.0916878  0.0234704  -3.907 9.71e-05 ***
Age         -0.0024831  0.0007319  -3.393 0.000707 ***
Sex1:Age    -0.0017244  0.0007319  -2.356 0.018566 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
Sigma link function:  log
Sigma Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -1.1605     0.0159  -73.01   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
NOTE: Additive smoothing terms exist in the formulas: 
 i) Std. Error for smoothers are for the linear effect only. 
ii) Std. Error for the linear terms maybe are not accurate. 
------------------------------------------------------------------
No. of observations in the fit:  1979 
Degrees of Freedom for the fit:  161.6924
      Residual Deg. of Freedom:  1817.308 
                      at cycle:  35 
 
Global Deviance:     1022.798 
            AIC:     1346.183 
            SBC:     2250.099 
******************************************************************

The compared models are:

Model Deviance AIC
Age * Sex 1022.8 1346.2
Age + Sex 1022.8 1344.2
Sex 1022.8 1342.2

So, even if Age has a significant (negative) effect and interaction with Sex (positive for males), adding it does not seem to be warranted here…

Modelled σ

******************************************************************
          Summary of the Quantile Residuals
                           mean   =  0.002353284 
                       variance   =  1.000499 
               coef. of skewness  =  0.02299299 
               coef. of kurtosis  =  2.90592 
Filliben correlation coefficient  =  0.9993823 
******************************************************************

The model including Age * Sex is:

******************************************************************
Family:  c("NO", "Normal") 

Call:  gamlss(formula = SR ~ 1 + Sex * Age + random(Text) +      random(Language) + random(Family) + random(Speaker),      sigma.formula = ~1 + Sex * Age + random(Text) +          random(Language) + random(Family) + random(Speaker),  
    family = NO(mu.link = "identity"), data = info.rate.data.for.age,      control = gamlss.control(n.cyc = 800, trace = FALSE),      i.control = glim.control(bf.cyc = 800)) 

Fitting method: RS() 

------------------------------------------------------------------
Mu link function:  identity
Mu Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  6.5465239  0.0203601 321.536  < 2e-16 ***
Sex1        -0.0442321  0.0203601  -2.172     0.03 *  
Age          0.0038165  0.0006483   5.887 4.71e-09 ***
Sex1:Age    -0.0030913  0.0006483  -4.768 2.01e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
Sigma link function:  log
Sigma Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.330351   0.053970 -24.650   <2e-16 ***
Sex1         0.019846   0.053970   0.368   0.7131    
Age          0.003029   0.001685   1.798   0.0724 .  
Sex1:Age    -0.002561   0.001685  -1.520   0.1286    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
NOTE: Additive smoothing terms exist in the formulas: 
 i) Std. Error for smoothers are for the linear effect only. 
ii) Std. Error for the linear terms maybe are not accurate. 
------------------------------------------------------------------
No. of observations in the fit:  1979 
Degrees of Freedom for the fit:  243.8639
      Residual Deg. of Freedom:  1735.136 
                      at cycle:  10 
 
Global Deviance:     721.7861 
            AIC:     1209.514 
            SBC:     2572.798 
******************************************************************

The compared models are:

Model Deviance AIC
Age * Sex 721.8 1209.5
Age + Sex 721.5 1206.1
Sex 721.3 1203

So, even if Age has a significant (negative) effect (but no interaction with Sex), adding it does not seem to be warranted here either…

The distribution of the residuals is less heteroscedastic than before and the fit to the data better.

Summary

Thus, for SR, even if there is a hint that Age might affect it negatively (and there might also be an interaction with Sex with a positive effect for males), overall, the various fit indices do not warrant its inclusion in the GAMLSS models.

IR

Fixed or modelled σ (variance)
Fixed σ

******************************************************************
          Summary of the Quantile Residuals
                           mean   =  1.384527e-05 
                       variance   =  1.000506 
               coef. of skewness  =  0.09166289 
               coef. of kurtosis  =  3.764844 
Filliben correlation coefficient  =  0.9974381 
******************************************************************

The model including Age * Sex is:

******************************************************************
Family:  c("NO", "Normal") 

Call:  gamlss(formula = IR ~ 1 + Sex * Age + random(Text) +      random(Language) + random(Family) + random(Speaker),      family = NO(mu.link = "identity"), data = info.rate.data.for.age,      control = gamlss.control(n.cyc = 800, trace = FALSE),  
    i.control = glim.control(bf.cyc = 800)) 

Fitting method: RS() 

------------------------------------------------------------------
Mu link function:  identity
Mu Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 40.015135   0.137507 291.003  < 2e-16 ***
Sex1        -0.175295   0.137508  -1.275    0.203    
Age         -0.035742   0.004288  -8.336  < 2e-16 ***
Sex1:Age    -0.023187   0.004288  -5.408 7.22e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
Sigma link function:  log
Sigma Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.6074     0.0159   38.21   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
NOTE: Additive smoothing terms exist in the formulas: 
 i) Std. Error for smoothers are for the linear effect only. 
ii) Std. Error for the linear terms maybe are not accurate. 
------------------------------------------------------------------
No. of observations in the fit:  1979 
Degrees of Freedom for the fit:  161.827
      Residual Deg. of Freedom:  1817.173 
                      at cycle:  2 
 
Global Deviance:     8020.303 
            AIC:     8343.957 
            SBC:     9248.626 
******************************************************************

The compared models are:

Model Deviance AIC
Age * Sex 8020.3 8344
Age + Sex 8020.3 8342
Sex 8020.3 8340

So, even if Age has a significant (negative) effect and interaction with Sex (positive for males) – interestingly, in this case the main effect of Sex disappears –, adding it does not seem to be warranted…

Modelled σ

******************************************************************
          Summary of the Quantile Residuals
                           mean   =  0.002387788 
                       variance   =  1.00051 
               coef. of skewness  =  0.02686615 
               coef. of kurtosis  =  2.9163 
Filliben correlation coefficient  =  0.9993496 
******************************************************************

The model including Age * Sex is:

******************************************************************
Family:  c("NO", "Normal") 

Call:  gamlss(formula = IR ~ 1 + Sex * Age + random(Text) +      random(Language) + random(Family) + random(Speaker),      sigma.formula = ~1 + Sex * Age + random(Text) +          random(Language) + random(Family) + random(Speaker),  
    family = NO(mu.link = "identity"), data = info.rate.data.for.age,      control = gamlss.control(n.cyc = 800, trace = FALSE),      i.control = glim.control(bf.cyc = 800)) 

Fitting method: RS() 

------------------------------------------------------------------
Mu link function:  identity
Mu Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 40.041563   0.118455 338.031  < 2e-16 ***
Sex1        -0.299328   0.118455  -2.527   0.0116 *  
Age         -0.036894   0.003708  -9.950  < 2e-16 ***
Sex1:Age    -0.019096   0.003708  -5.150  2.9e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
Sigma link function:  log
Sigma Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.479445   0.053990   8.880   <2e-16 ***
Sex1         0.025025   0.053990   0.464    0.643    
Age          0.001854   0.001685   1.100    0.271    
Sex1:Age    -0.002747   0.001685  -1.630    0.103    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
NOTE: Additive smoothing terms exist in the formulas: 
 i) Std. Error for smoothers are for the linear effect only. 
ii) Std. Error for the linear terms maybe are not accurate. 
------------------------------------------------------------------
No. of observations in the fit:  1979 
Degrees of Freedom for the fit:  244.3689
      Residual Deg. of Freedom:  1734.631 
                      at cycle:  10 
 
Global Deviance:     7726.653 
            AIC:     8215.391 
            SBC:     9581.498 
******************************************************************

The compared models are:

Model Deviance AIC
Age * Sex 7726.7 8215.4
Age + Sex 7726.4 8212.2
Sex 7725.8 8208.8

So, even if Age has a significant (negative) effect and interaction with Sex (positive for males) – interestingly, in this case the main effect of Sex disappears –, adding it does not seem to be warranted…

The distribution of the residuals is less heteroscedastic than before and the fit to the data better.

Summary

Thus, while for IR the hint that Age has a negative main effect and interacts with Sex (with a positive effect for males, containing the whole effect of Sex) is much stronger, the various fit indices do not warrant its inclusion in the GAMLSS models.

Modelling the relationship between SR and ID

******************************************************************
          Summary of the Quantile Residuals
                           mean   =  0.001744301 
                       variance   =  1.000502 
               coef. of skewness  =  0.008063776 
               coef. of kurtosis  =  2.834364 
Filliben correlation coefficient  =  0.9993747 
******************************************************************

The model including Age * Sex is:

******************************************************************
Family:  c("NO", "Normal") 

Call:  gamlss(formula = SR ~ 1 + ID + Sex * Age + random(Text) +      random(Speaker) + random(Family), sigma.formula = ~1 +      ID + Sex * Age + random(Text) + random(Speaker) +      random(Family), family = NO(mu.link = "identity"),  
    data = info.rate.data.for.age, control = gamlss.control(n.cyc = 800,          trace = FALSE), i.control = glim.control(bf.cyc = 800)) 

Fitting method: RS() 

------------------------------------------------------------------
Mu link function:  identity
Mu Coefficients:
              Estimate Std. Error  t value Pr(>|t|)    
(Intercept) 12.6793628  0.0517065  245.218  < 2e-16 ***
ID          -0.9813533  0.0071683 -136.903  < 2e-16 ***
Sex1         0.0102517  0.0204390    0.502    0.616    
Age         -0.0059906  0.0006663   -8.991  < 2e-16 ***
Sex1:Age    -0.0050056  0.0006537   -7.658 3.12e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
Sigma link function:  log
Sigma Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.790248   0.133246  -5.931 3.63e-09 ***
ID          -0.086078   0.019128  -4.500 7.24e-06 ***
Sex1         0.046494   0.054368   0.855   0.3926    
Age          0.001961   0.001718   1.141   0.2540    
Sex1:Age    -0.003459   0.001698  -2.037   0.0418 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

------------------------------------------------------------------
NOTE: Additive smoothing terms exist in the formulas: 
 i) Std. Error for smoothers are for the linear effect only. 
ii) Std. Error for the linear terms maybe are not accurate. 
------------------------------------------------------------------
No. of observations in the fit:  1979 
Degrees of Freedom for the fit:  236.269
      Residual Deg. of Freedom:  1742.731 
                      at cycle:  8 
 
Global Deviance:     705.2126 
            AIC:     1177.751 
            SBC:     2498.577 
******************************************************************

The compared models are:

Model Deviance AIC
ID * Sex * Age 705.2 1188.4
ID + Sex * Age 705.2 1177.8
ID * Sex + Age 704.2 1178.2
ID + Sex + Age 704.5 1174.5
ID * Sex 704.4 1174.6
ID + Sex 704.6 1170.9

Clearly, adding Age is not warranted here (as is the interaction between ID and Sex)…

As above, we also looked a the simple lmer model:

The compared models are:

Model AIC
ID * Sex * Age 1811.2
ID + Sex * Age 1790.7
ID * Sex + Age 1786.3
ID + Sex + Age 1781
ID * Sex 1777.8
ID + Sex 1772.5

The best model is still the one not including Age:

Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: SR ~ 1 + ID + Sex + (1 | Text) + (1 | Speaker) + (1 | Family)
   Data: info.rate.data.for.age

REML criterion at convergence: 1758.5

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.8062 -0.6300  0.0174  0.5882  5.2084 

Random effects:
 Groups   Name        Variance Std.Dev.
 Speaker  (Intercept) 0.35876  0.5990  
 Text     (Intercept) 0.01494  0.1222  
 Family   (Intercept) 0.26668  0.5164  
 Residual             0.10575  0.3252  
Number of obs: 1979, groups:  Speaker, 132; Text, 15; Family, 9

Fixed effects:
             Estimate Std. Error        df t value Pr(>|t|)    
(Intercept)  10.80777    0.77185  27.12524  14.002 6.21e-14 ***
ID           -0.70707    0.12477  31.07295  -5.667 3.15e-06 ***
Sex1         -0.14683    0.05268 119.92464  -2.787  0.00618 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
     (Intr) ID    
ID   -0.971       
Sex1 -0.001  0.003

Conclusions about Age

Within the limits of our reduced dataset (containing only 132 speakers with Age info), we found the following:

When modelling SR and IR with GAMLSS, while there are hints that Age has, for both, overall:

  • a negative main effect (i.e., older speakers of both sexes are slower and transmit less information per unit time across texts and languages), and
  • an interaction with Sex (positive for males, i.e., males are less negatively affected by age that females across texts and languages),

it does not seem warranted to include it in these models.

When modelling the relationship between SR and ID, this negative relationship:

  • seems strengthened with increasing Age (i.e., older speakers of both sexes shows a stronger negative relationship between SR and ID), and
  • there is an interaction with Sex (for females, this effect of Age is stronger than for males),

but, alas, the inclusion of Age is not warranted in the GAMLSS model, nor (really) in the simpler LMER model.

Thus, while Age seems to negatively influence (in a sex-dependent manner) both SR and IR, as well as strengthen the negative relationship between them, its effects are far from clear in the current dataset.

Mixture of Gaussians

In what follows, mixing probabilities are independent from factors such as Sex.

SR

Between 1 and 5 Gaussian distributions:

1 component


Mixing Family:  "NO" 

Fitting method: EM algorithm 

Call:  gamlssMX(formula = SR ~ 1, family = NO, K = 1, data = d, plot = FALSE) 

Mu Coefficients for model: 1 
(Intercept)  
      6.631  
Sigma Coefficients for model: 1 
(Intercept)  
     0.1378  

Estimated probabilities: 1 

Degrees of Freedom for the fit: 2 Residual Deg. of Freedom   2286 
Global Deviance:     7123.61 
            AIC:     7127.61 
            SBC:     7139.08 

2 components


Mixing Family:  c("NO", "NO") 

Fitting method: EM algorithm 

Call:  gamlssMX(formula = SR ~ 1, family = NO, K = 2, data = d, plot = FALSE) 

Mu Coefficients for model: 1 
(Intercept)  
      7.149  
Sigma Coefficients for model: 1 
(Intercept)  
    -0.1855  
Mu Coefficients for model: 2 
(Intercept)  
      5.259  
Sigma Coefficients for model: 2 
(Intercept)  
    -0.4788  

Estimated probabilities: 0.7260515 0.2739485 

Degrees of Freedom for the fit: 5 Residual Deg. of Freedom   2283 
Global Deviance:     7001.22 
            AIC:     7011.22 
            SBC:     7039.9 

3 components


Mixing Family:  c("NO", "NO", "NO") 

Fitting method: EM algorithm 

Call:  gamlssMX(formula = SR ~ 1, family = NO, K = 3, data = d, plot = FALSE) 

Mu Coefficients for model: 1 
(Intercept)  
      7.106  
Sigma Coefficients for model: 1 
(Intercept)  
    -0.2769  
Mu Coefficients for model: 2 
(Intercept)  
      7.207  
Sigma Coefficients for model: 2 
(Intercept)  
     -0.133  
Mu Coefficients for model: 3 
(Intercept)  
       5.27  
Sigma Coefficients for model: 3 
(Intercept)  
    -0.4741  

Estimated probabilities: 0.3374419 0.3826253 0.2799327 

Degrees of Freedom for the fit: 8 Residual Deg. of Freedom   2280 
Global Deviance:     7000.8 
            AIC:     7016.8 
            SBC:     7062.69 

4 components


Mixing Family:  c("NO", "NO", "NO", "NO") 

Fitting method: EM algorithm 

Call:  gamlssMX(formula = SR ~ 1, family = NO, K = 4, data = d, plot = FALSE) 

Mu Coefficients for model: 1 
(Intercept)  
      7.784  
Sigma Coefficients for model: 1 
(Intercept)  
    -0.3468  
Mu Coefficients for model: 2 
(Intercept)  
      5.811  
Sigma Coefficients for model: 2 
(Intercept)  
    -0.2546  
Mu Coefficients for model: 3 
(Intercept)  
      7.124  
Sigma Coefficients for model: 3 
(Intercept)  
    -0.7752  
Mu Coefficients for model: 4 
(Intercept)  
      5.442  
Sigma Coefficients for model: 4 
(Intercept)  
    -0.3356  

Estimated probabilities: 0.2584335 0.2204151 0.2985522 0.2225992 

Degrees of Freedom for the fit: 11 Residual Deg. of Freedom   2277 
Global Deviance:     6990.94 
            AIC:     7012.94 
            SBC:     7076.03 

5 components


Mixing Family:  c("NO", "NO", "NO", "NO", "NO") 

Fitting method: EM algorithm 

Call:  gamlssMX(formula = SR ~ 1, family = NO, K = 5, data = d, plot = FALSE) 

Mu Coefficients for model: 1 
(Intercept)  
      6.396  
Sigma Coefficients for model: 1 
(Intercept)  
    -0.3948  
Mu Coefficients for model: 2 
(Intercept)  
      6.504  
Sigma Coefficients for model: 2 
(Intercept)  
    -0.3149  
Mu Coefficients for model: 3 
(Intercept)  
      5.132  
Sigma Coefficients for model: 3 
(Intercept)  
    -0.5414  
Mu Coefficients for model: 4 
(Intercept)  
      7.267  
Sigma Coefficients for model: 4 
(Intercept)  
     -1.037  
Mu Coefficients for model: 5 
(Intercept)  
      7.959  
Sigma Coefficients for model: 5 
(Intercept)  
    -0.4526  

Estimated probabilities: 0.2027298 0.180915 0.2175461 0.191181 0.2076281 

Degrees of Freedom for the fit: 14 Residual Deg. of Freedom   2274 
Global Deviance:     6981.9 
            AIC:     7009.9 
            SBC:     7090.2 

Comparing AIC

            df      AIC
mix.SR.NO.5 14 7009.904
mix.SR.NO.2  5 7011.223
mix.SR.NO.4 11 7012.941
mix.SR.NO.3  8 7016.804
mix.SR.NO.1  2 7127.609

Showing the distributions

Mixture of Gaussians for SR.

Mixture of Gaussians for SR.

IR

Between 1 and 5 Gaussian distributions:

1 component


Mixing Family:  "NO" 

Fitting method: EM algorithm 

Call:  gamlssMX(formula = IR ~ 1, family = NO, K = 1, data = d, plot = FALSE) 

Mu Coefficients for model: 1 
(Intercept)  
      39.15  
Sigma Coefficients for model: 1 
(Intercept)  
      1.629  

Estimated probabilities: 1 

Degrees of Freedom for the fit: 2 Residual Deg. of Freedom   2286 
Global Deviance:     13945.2 
            AIC:     13949.2 
            SBC:     13960.7 

2 components


Mixing Family:  c("NO", "NO") 

Fitting method: EM algorithm 

Call:  gamlssMX(formula = IR ~ 1, family = NO, K = 2, data = d, plot = FALSE) 

Mu Coefficients for model: 1 
(Intercept)  
      41.27  
Sigma Coefficients for model: 1 
(Intercept)  
      1.863  
Mu Coefficients for model: 2 
(Intercept)  
      38.38  
Sigma Coefficients for model: 2 
(Intercept)  
      1.447  

Estimated probabilities: 0.266011 0.733989 

Degrees of Freedom for the fit: 5 Residual Deg. of Freedom   2283 
Global Deviance:     13894.8 
            AIC:     13904.8 
            SBC:     13933.5 

3 components


Mixing Family:  c("NO", "NO", "NO") 

Fitting method: EM algorithm 

Call:  gamlssMX(formula = IR ~ 1, family = NO, K = 3, data = d, plot = FALSE) 

Mu Coefficients for model: 1 
(Intercept)  
         40  
Sigma Coefficients for model: 1 
(Intercept)  
     0.8341  
Mu Coefficients for model: 2 
(Intercept)  
       42.1  
Sigma Coefficients for model: 2 
(Intercept)  
      1.725  
Mu Coefficients for model: 3 
(Intercept)  
      36.09  
Sigma Coefficients for model: 3 
(Intercept)  
      1.395  

Estimated probabilities: 0.2444004 0.350947 0.4046526 

Degrees of Freedom for the fit: 8 Residual Deg. of Freedom   2280 
Global Deviance:     13875.8 
            AIC:     13891.8 
            SBC:     13937.6 

4 components


Mixing Family:  c("NO", "NO", "NO", "NO") 

Fitting method: EM algorithm 

Call:  gamlssMX(formula = IR ~ 1, family = NO, K = 4, data = d, plot = FALSE) 

Mu Coefficients for model: 1 
(Intercept)  
         41  
Sigma Coefficients for model: 1 
(Intercept)  
      1.127  
Mu Coefficients for model: 2 
(Intercept)  
      39.56  
Sigma Coefficients for model: 2 
(Intercept)  
    0.08543  
Mu Coefficients for model: 3 
(Intercept)  
      36.14  
Sigma Coefficients for model: 3 
(Intercept)  
       1.38  
Mu Coefficients for model: 4 
(Intercept)  
      42.51  
Sigma Coefficients for model: 4 
(Intercept)  
      1.774  

Estimated probabilities: 0.222081 0.07674445 0.4389749 0.2621997 

Degrees of Freedom for the fit: 11 Residual Deg. of Freedom   2277 
Global Deviance:     13868.4 
            AIC:     13890.4 
            SBC:     13953.5 

5 components


Mixing Family:  c("NO", "NO", "NO", "NO", "NO") 

Fitting method: EM algorithm 

Call:  gamlssMX(formula = IR ~ 1, family = NO, K = 5, data = d, plot = FALSE) 

Mu Coefficients for model: 1 
(Intercept)  
      40.91  
Sigma Coefficients for model: 1 
(Intercept)  
      1.303  
Mu Coefficients for model: 2 
(Intercept)  
      34.42  
Sigma Coefficients for model: 2 
(Intercept)  
      1.252  
Mu Coefficients for model: 3 
(Intercept)  
      39.47  
Sigma Coefficients for model: 3 
(Intercept)  
     0.5634  
Mu Coefficients for model: 4 
(Intercept)  
      43.68  
Sigma Coefficients for model: 4 
(Intercept)  
      1.767  
Mu Coefficients for model: 5 
(Intercept)  
      39.64  
Sigma Coefficients for model: 5 
(Intercept)  
      1.289  

Estimated probabilities: 0.211754 0.2798107 0.1459857 0.1801956 0.182254 

Degrees of Freedom for the fit: 14 Residual Deg. of Freedom   2274 
Global Deviance:     13868.7 
            AIC:     13896.7 
            SBC:     13977 

Comparing AIC

            df      AIC
mix.IR.NO.4 11 13890.37
mix.IR.NO.3  8 13891.76
mix.IR.NO.5 14 13896.71
mix.IR.NO.2  5 13904.80
mix.IR.NO.1  2 13949.21

Showing the distributions

Mixture of Gaussians for IR.

Mixture of Gaussians for IR.

Tests of unimodality

We used three ways to estimate how unimodal a distribution is, as they tend to disagree and the problem of unimodality testing is far from settled (see Freeman & Dale, 2013):

  • the Silverman test tests the null hypothesis that an underlying density has at most k modes. The null hypothesis is that the underlying density has at most k modes (H0: number of modes <= k). The result is the p-value (bootstrapped) of rejecting a unimodal distribution. It is described in Silverman (1981) and Hall & York (2001); our implementation is based on the code available at https://www.mathematik.uni-marburg.de/~stochastik/R_packages/;
  • the dip test which computes Hartigans’ dip statistic D_n_, and its p-value for the test for unimodality, by interpolating tabulated quantiles of sqrt(n) • D_n_. For X_i_F, i.i.d., the null hypothesis is that F is a unimodal distribution. The result is the D metric and the p-value (interpolated) of rejecting a unimodal distribution. See Hartigan (1985) and Hartigan & Hartigan (1985); it is implemented in package diptest;
  • the bimodality coefficient (BC) which is based on an empirical relationship between bimodality and the third and fourth statistical moments of a distribution (skewness and kurtosis). It is proportional to the division of squared skewness with uncorrected kurtosis, BC ≈ (s2 + 1)/k, with the underlying logic that a bimodal distribution will have very low kurtosis, an asymmetric character, or both; all of these conditions increase BC. The values range from 0 and 1, with those exceeding .555 (the value representing a uniform distribution) suggesting bi-modality. The result is the DC estimate (which must exceed 0.555 to reject a unimodal distribution). We implemented it following Freeman & Dale (2013) as BC = (s2 + 1)/(k + 3 • ((n-1)2 / ((n-2) • (n-3)))).

For each such test, we performed four randomisation procedures to obtain an estimate of the “specialness” of the observed unimodality estimate; for each new permuted dataset, we recompute everything before estimating the unimodlaity of the permuted distribution:

  • Permutation model 1 (PM1): randomly permute the SR values freely among speakers, texts and languages;
  • Permutation model 2 (PM2): randomly permute the ID values among languages;
  • Permutation model 3 (PM3): randomly permute the Speaker average SR values among speakers (irrespective of languages);
  • Permutation model 4 (PM4): randomly permute the Language average SR values among languages.

Visual comparison

The observed estimate (vertical blue solid line), the permuted distribution (gray histogram), and the “unimodality region” (shaded green rectangle) are shown below (for PM3, we also show the original estimate using the Speaker average SR as a vertical solid red line).

PM1

**Permutation of the texts' SRs (PM1)**.

Permutation of the texts’ SRs (PM1).

PM2

**Permutation of the languages' ID (PM2)**.

Permutation of the languages’ ID (PM2).

PM3

**Permutation of the speakers' average SRs (PM3)**.

Permutation of the speakers’ average SRs (PM3).

PM4

**Permutation of the languages' average SRs with speaker adjustement (PM4)**.

Permutation of the languages’ average SRs with speaker adjustement (PM4).

Summary of unimodality tests

Summary of unimodality permutation tests.
Scenario Measure Test Observed estimate (p-value) % more unimodal permutations
PM1 SR Silverman - (0.017) 98%
PM1 SR Dip 0.005 (0.984) * 100%
PM1 SR BC 0.19 () * 100%
PM1 IR Silverman - (0.85) * 13.1%
PM1 IR Dip 0.005 (0.992) * 73.3%
PM1 IR BC 0.167 () * 100%
PM2 SR Silverman - (0.017) 98%
PM2 SR Dip 0.005 (0.984) * 100%
PM2 SR BC 0.19 () * 100%
PM2 IR Silverman - (0.85) * 1.9%
PM2 IR Dip 0.005 (0.992) * 26.5%
PM2 IR BC 0.167 () * 96.7%
PM3 SR Silverman - (0.017) 100%
PM3 SR Dip 0.005 (0.984) * 0%
PM3 SR BC 0.19 () * 100%
PM3 IR Silverman - (0.85) * 14%
PM3 IR Dip 0.005 (0.992) * 18.4%
PM3 IR BC 0.167 () * 99.9%
PM4 SR Silverman - (0.017) 6.6%
PM4 SR Dip 0.005 (0.984) * 1.8%
PM4 SR BC 0.19 () * 82.8%
PM4 IR Silverman - (0.85) * 0.9%
PM4 IR Dip 0.005 (0.992) * 14.6%
PM4 IR BC 0.167 () * 96.3%

Pair-wise distances between languages

We compute various distances between languages (as implemented by function distance() in package philentropy) in what concerns the distribution of NS, SR and ID.

Comparing the distribution of pairwise distances between languages.

Comparing the distribution of pairwise distances between languages.

Paired permutation t-tests comparing measures with 1,000 permutations.
m1 m2 d mean1 median1 sd1 mean2 median2 sd2 p
IR NS Hellinger 0.88 0.83 0.32 1.20 1.19 0.39 0.00
IR NS Jensen-Shannon 0.17 0.14 0.12 0.29 0.26 0.16 0.00
IR NS Kolmogorov–Smirnov 0.42 0.37 0.20 0.57 0.57 0.23 0.00
IR NS Kullback-Leibler 7.13 4.22 7.81 15.42 13.12 13.31 0.00
IR NS Squared-Chi 0.56 0.45 0.36 0.88 0.79 0.47 0.00
IR SR Hellinger 0.88 0.83 0.32 1.10 1.06 0.48 0.00
IR SR Jensen-Shannon 0.17 0.14 0.12 0.27 0.23 0.20 0.00
IR SR Kolmogorov–Smirnov 0.42 0.37 0.20 0.56 0.55 0.27 0.00
IR SR Kullback-Leibler 7.13 4.22 7.81 12.80 6.69 14.88 0.00
IR SR Squared-Chi 0.56 0.45 0.36 0.86 0.77 0.57 0.00
NS SR Hellinger 1.20 1.19 0.39 1.10 1.06 0.48 0.00
NS SR Jensen-Shannon 0.29 0.26 0.16 0.27 0.23 0.20 0.32
NS SR Kolmogorov–Smirnov 0.57 0.57 0.23 0.56 0.55 0.27 0.79
NS SR Kullback-Leibler 15.42 13.12 13.31 12.80 6.69 14.88 0.04
NS SR Squared-Chi 0.88 0.79 0.47 0.86 0.77 0.57 0.60

References

Campione, E., & Véronis, J. (1998). A multilingual prosodic database, Proc. of the 5th International Conference on Spoken Language Pro cessing (ICSLP’98), Sydney, Australia, 3163-3166.

Freeman, J. B., & Dale, R. (2013). Assessing bimodality to detect the presence of a dual cognitive process. Behavior research methods, 45(1), 83-97.

Hall, P., & York, M. (2001). On the calibration of Silverman’s test for multimodality. Statistica Sinica, 11, 515-536.

Hartigan, J. A., & Hartigan, P. M. (1985) The Dip Test of Unimodality. Annals of Statistics 13, 70–84.

Hartigan, P. M. (1985) Computation of the Dip Statistic to Test for Unimodality. Applied Statistics (JRSS C) 34, 320–325.

Le, V. B., Tran, D. D., Castelli, E., Besacier, L., & Serignat, J. F. (2004). Spoken and Written Language Resources for Vietnamese. In LREC. 4, pp. 599-602.

Lyding, V., Stemle, E., Borghetti, C., Brunello, M., Castagnoli, S., Dell’Orletta, F., Dittmann, H., Lenci, A., & Pirrelli, V. (2014). The PAISÀ Corpus of Italian Web Texts. In Proceedings of the 9th Web as Corpus Workshop (WaC-9). Association for Computational Linguistics, Gothenburg, Sweden, 36-43.

New B., Pallier C., Ferrand L., & Matos R. (2001). Une base de données lexicales du français contemporain sur internet: LEXIQUE 3.80, L’Année Psychologique, 101, 447-462. http://www.lexique.org.

Oh, Y. M. (2015). Linguistic complexity and information: quantitative approaches. PhD Thesis, Université de Lyon, France. Retrieved from http://www.afcp-parole.org/doc/theses/these_YMO15.pdf

Perea, M., Urkia, M., Davis, C. J., Agirre, A., Laseka, E., & Carreiras, M. (2006). E-Hitz: A word frequency list and a program for deriving psycholinguistic statistics in an agglutinative language (Basque). Behavior Research Methods, 38(4), 610-615.

Sharoff, S. (2006). Creating general-purpose corpora using automated search engine queries. In Baroni, M. and Bernardini, S. (Eds.) WaCky! Working papers on the web as corpus, Gedit, Bologna, http://corpus.leeds.ac.uk/queryzh.html.

Silverman, B.W. (1981). Using Kernel Density Estimates to investigate Multimodality. Journal of the Royal Statistical Society, Series B, 43, 97-99.

Váradi, T. (2002). The Hungarian National Corpus. In LREC.

Zséder, A., Recski, G., Varga, D., & Kornai, A. (2012). Rapid creation of large-scale corpora and frequency dictionaries. In Proceedings to LREC 2012.

Appendix I: R session info

This document was compiled on:

R version 3.4.4 (2018-03-15)

**Platform:** x86_64-pc-linux-gnu (64-bit)

locale: LC_CTYPE=en_US.UTF-8, LC_NUMERIC=C, LC_TIME=en_US.UTF-8, LC_COLLATE=en_US.UTF-8, LC_MONETARY=en_US.UTF-8, LC_MESSAGES=en_US.UTF-8, LC_PAPER=en_US.UTF-8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.UTF-8 and LC_IDENTIFICATION=C

attached base packages: grid, parallel, splines, stats, graphics, grDevices, datasets, utils, methods and base

other attached packages: ggpubr(v.0.2), magrittr(v.1.5), ggrepel(v.0.8.0), RColorBrewer(v.1.1-2), dplyr(v.0.8.0.1), ggridges(v.0.5.1), broman(v.0.68-2), philentropy(v.0.3.0), pander(v.0.6.3), moments(v.0.14), sjPlot(v.2.6.2), sjstats(v.0.17.3), gamlss.mx(v.4.3-5), nnet(v.7.3-12), gamlss(v.5.1-2), nlme(v.3.1-137), gamlss.dist(v.5.1-1), MASS(v.7.3-51.1), gamlss.data(v.5.1-0), lmerTest(v.3.1-0), lme4(v.1.1-20), Matrix(v.1.2-15), plyr(v.1.8.4), reshape2(v.1.4.3), ggplot2(v.3.1.0) and RhpcBLASctl(v.0.18-205)

loaded via a namespace (and not attached): numDeriv(v.2016.8-1), tools(v.3.4.4), TMB(v.1.7.15), backports(v.1.1.3), R6(v.2.4.0), sjlabelled(v.1.0.16), lazyeval(v.0.2.1), colorspace(v.1.4-0), withr(v.2.1.2), tidyselect(v.0.2.5), mnormt(v.1.5-5), emmeans(v.1.3.2), compiler(v.3.4.4), sandwich(v.2.5-0), labeling(v.0.3), scales(v.1.0.0), mvtnorm(v.1.0-8), psych(v.1.8.12), stringr(v.1.4.0), digest(v.0.6.18), foreign(v.0.8-71), minqa(v.1.2.4), rmarkdown(v.1.11), stringdist(v.0.9.5.1), pkgconfig(v.2.0.2), htmltools(v.0.3.6), highr(v.0.7), pwr(v.1.2-2), rlang(v.0.3.1), generics(v.0.0.2), zoo(v.1.8-4), modeltools(v.0.2-22), bayesplot(v.1.6.0), Rcpp(v.1.0.0), munsell(v.0.5.0), prediction(v.0.3.6.2), stringi(v.1.3.1), multcomp(v.1.4-8), yaml(v.2.2.0), snakecase(v.0.9.2), sjmisc(v.2.7.7), forcats(v.0.4.0), crayon(v.1.3.4), lattice(v.0.20-38), ggeffects(v.0.8.0), haven(v.2.1.0), hms(v.0.4.2), knitr(v.1.21), pillar(v.1.3.1), estimability(v.1.3), codetools(v.0.2-16), stats4(v.3.4.4), glue(v.1.3.0), evaluate(v.0.13), data.table(v.1.12.0), modelr(v.0.1.4), nloptr(v.1.2.1), gtable(v.0.2.0), purrr(v.0.3.0), tidyr(v.0.8.2), assertthat(v.0.2.0), xfun(v.0.5), coin(v.1.2-2), xtable(v.1.8-3), broom(v.0.5.1), coda(v.0.19-2), survival(v.2.43-3), tibble(v.2.0.1), glmmTMB(v.0.2.3) and TH.data(v.1.0-10)

Appendix II: Figures for the main paper

Here we generate the figures used in the main paper (saved to the ./figures folder as 600 DPI TIFF files Figure-*.tiff).